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Abstract — The growing amount of applications that gen- 
erate vast amount of data in short time scales render the 
problem of partial monitoring, coupled with prediction, 
a rather fundamental one. We study the aforementioned 
canonical problem under the context of large-scale monitor- 
ing of communication networks. We consider the problem 
of selecting the "best" subset of links so as to optimally 
predict the quantity of interest at the remaining ones. 
This is a well know NP-hard problem, and algorithms 
seeking the exact solution are prohibitively expensive. We 
present a number of approximation algorithms that: 1) their 
computational complexity gains a significant improvement 
over existing greedy algorithms; 2) exploit the geometry of 
principal component analysis, which also helps us establish 
theoretical bounds on the prediction error; 3) are amenable 
for randomized implementation and execution in parallel 
or distributed fashion, a process that often yields the 
exact solution. The new algorithms are demonstrated and 
evaluated using real-world network data. 



I. Introduction 



A. Motivation 



ADVANCES in high-throughput technologies for 
information collection have led to unprecedented 
growth of data that need to be analyzed and interpreted. 
Resource constraints, however, impose limitations on 
the amount of data that can be processed. For exam- 
ple, in network monitoring, limited bandwidth prevents 
all data from being sent to a centralized coordina- 
tor (e.g., Cisco's Netflow Collector [fl~)). In wireless 
sensor applications, such as environmental monitoring, 
sensors' power constraints dictate a wise utilization of 
the available resources. Therefore, designing efficient 
algorithms for information learning through prediction 
is an important problem that needs to be tackled. This 
modeling approach allows for online prediction of the 
information of interest, by utilizing measurements of a 
much smaller set of variables. Obviously, the accuracy 
of the prediction heavily depends on the selection of 
variables to be monitored. One aims to minimize the 
uncertainty of the information learned. This uncertainty 
is usually expressed via the covariance matrix of the 
prediction error. This is a canonical design problem 
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with various applications such as, optimal monitoring of 
computer networks for identification of traffic anoma- 
lies 0, stock market prediction J3], sensors' placement 
for environmental monitoring (see ||4] and references 
therein), transportation network design @, etc. In this 
study, we develop fast algorithms addressing the problem 
at hand in the context of network monitoring; namely, 
how to select K network links, so as to "best" predict 
link utilization on the remaining ones (see Figures [8] 
and HJl. 

Online monitoring for detection and classification of 
anomalies on a computer network's traffic is critical 
for its smooth operation, especially in the presence of 
new protocols {e.g., peer-to-peer) and applications such 
as social networking and cloud computing. In |2|, the 
authors propose a method based on principal component 
analysis (PCA) for anomaly detection by classifying 
traffic measurements into the categories of normal or 
anomalous. In J6|, a scheme for fault detection based on 
the distributional characteristics of traffic is introduced 
that is able to capture spatial and temporal abnormal 
activities. In both studies, periodic traffic flow mea- 
surements must be collected on a large set of links, a 
costly and computationally challenging task. In Q, this 
problem is partly alleviated by considering measuring 
aggregate traffic coupled with a sampling mechanism 
that significantly reduces the data collection and pro- 
cessing overhead. 

Global monitoring of large-scale networks, however, 
involves large traffic volumes, and becomes impractical 
and often impossible due to resource constraints. Nev- 
ertheless, using recent work on global traffic modeling 
|H1, network prediction and kriging (9), iflOl one can 
develop an accurate statistical model of the evolution 
of traffic flows across an entire network by monitoring 
only a subset of network links. Namely, by taking 
into account the routing of traffic flows and the long- 
range dependence in time of traffic traces, one obtains a 
statistical model for the dependence between the traffic 
loads on all links in the network. This model can then 
be exploited for optimal traffic prediction. 

To informally introduce the problem under study, 
consider the toy example of Fig.Q]based on the Internet2 
network ifTTl . The active links consist of the highlighted 
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Fig. 1. A toy example on a subnetwork of the Internet2 topology. 
For the real-world application on the full topology see Section Ivnl 



ones. Suppose we have the following 5 network flows 
with the same statistical characteristics, symbolized as 
(S, D) pairs where S is the source node and D the 
destination: (SALT, NEWY), (SALT, ATLA), (HOUS, 
NEWY), (HOUS, ATLA) and (KANS, ATLA). Assum- 
ing that we have specified our favorite criterion for mea- 
suring the prediction uncertainty, we ask the following 
question: Should we had to choose only two links so as 
to optimally predict the traffic at the remaining ones what 
would the chosen set be? One can easily guess that the 
first member of that set should be link c. This is because 
link c serves the most flows; indeed, link c belongs to the 
optimal set (in general, more than one might exist). The 
other link can be either d or e. Finally, the less obvious 
choice {d, e} is also an optimal solution. 

The situation becomes trickier if we assign link 
preferences. For example, a network operator could 
face different scenarios and may assign distinct link 
weights/priorities, such as monitoring links which can 
be more error prone, or links that are more susceptible 
to network anomalies, or even not wanting to observe 
inexpensive links, etc. In our previous network topology, 
link a is inserted into the observed set - together with 
link c - when its priority elevates. To make matters 
more complicated, assume that flows are correlated and 
have different traffic characteristics. Moreover, add the 
requirement that these decisions should be made online 
and fast, since network conditions and/or operators' 
preferences change frequently. Amid all these, add the 
fact that in real-life we have to deal with hundreds or 
thousands of links and traffic flows. One can then see a 
complete picture of how hard this large-scale monitoring 
problem can become, and that efficient approximation 
algorithms could be very appealing to network managers. 

B. Contributions 

We study the fundamental problem of finding the 
subset of links to monitor which yields best overall 
prediction for the remaining ones. The objective is, given 
a "budget" of K links to be selected (out of a total of 
L), to determine the optimal set of them in order to 
predict the network traffic on the unobserved ones with 
minimum uncertainty. We formulate a canonical model 
(see Section fill- All which we then tailor for tackling the 



challenging problem of large-scale network monitoring. 
We emphasize that our algorithms can be readily applied 
to any other optimal learning problem that fits under the 
framework of Section IIII-AI In the canonical problem, 
we aim to minimize a functional of the prediction error, 
which represents the uncertainty on the information 
that is learned. There are several popular criteria for 
measuring the uncertainty, such as entropy (also known 
as D-optimality, see lfl2l ) of the design matrix, mutual 
information [4], trace (A-optimality) and spectral norm 
(E-optimality) lfl3ll . In this work, we measure the quality 
of the prediction by investigating the criteria of A- and 
E-optimality. 

Optimal prediction belongs to the category of sub- 
set selection problems, which are combinatorial, NP- 
hard problems. For example, min- and max-entropy (D- 
optimality) sampling are shown to be NP-hard problems 
in lfl2l by reduction from the well-known NP-hard prob- 
lems of CLIQUE and STABLE SET Q4). NP-hardness 
of our problem remains in open problem. However, 
a proof of NP-hardness for an analogous problem is 
given in the Appendix. In principle, exact solutions can 
be obtained by using an integer program formulation. 
However, the problem becomes prohibitively expensive 
for even moderate-size datasets. For example, an ex- 
haustive search implementation in CPLEX running on a 
medium-sized computing cluster required several hours 
to complete the solution for a 26 variables problem with 
a budget of 12. Other methods, such as the mixed-integer 
program with constraint generation proposed in lfj"5l . 
are also computationally expensive and impractical. The 
latter algorithm requires also submodularity for the ob- 
jective set function. Unfortunately, submodularity, which 
is similar to the convexity property for real valued 
functions, is not satisfied for all the objective criteria 
of interest. Branch and bound methods could be ap- 
plied (see lfl2l '). but the huge number of "branching" 
choices makes the procedure unattractive for online im- 
plementations in large-scale networks. These challenges 
emphasize the need for approximate, but fast heuristic 
algorithms for solving the optimal prediction problem. 

The main contributions of our study are: 

1) We introduce new heuristic algorithms that exhibit 
superior computational complexity to existing ones. 
The complexity of our algorithms ranges from 
0(Kn 2 ) to 0(Kn 3 ), a significant improvement 
over existing greedy algorithms with 0(Kn A ) com- 
plexity, where n represents the total number of 
variables. These efficiency gains allow one to solve 
the optimal design problem in a dynamic, online 
fashion. 

2) By exploiting the geometry of the objective func- 
tions used, together with connections to PCA we 
obtain prediction error bounds that do not require 
submodularity of the objective functions as in exist- 



ing work (e.g. see (01). Furthermore, these bounds 
help us assess the quality of our approximate solu- 
tion with respect to the optimal solution. 
3) Randomized versions of the proposed algorithms 
can be implemented in a parallel or distributed fash- 
ion. This leads to further improvements in practice, 
often yielding the optimal solution (see Fig. |3(a)[ ). 

The remainder of the paper is organized as follows: in 
Section HI1 we outline the related work. In Section [TTT1 we 
formally define the canonical problem, followed by the 
network monitoring problem. In Section |IV] we discuss 
the connections with PCA and derive our PCA-based 
lower bound on the performance of our algorithms. Sec- 
tion [V] introduces and analyzes our three approximation 
algorithms, and concludes with possible extensions to 
their randomized versions. Section |VT] discusses theo- 
retical performance guarantees for the error reduction 
achieved at each step. Next, in Section [VTll we evaluate 
our methods in a variety of network scenarios, including 
the real-world dataset obtained from the Internet2 topol- 
ogy lOTl . We also explain how a network operator can 
choose the budget K, and how weighted link monitoring 
can be achieved. 

II. Literature Survey 

The combinatorial optimization problem at hand - 
formally introduced in Section IIII-AI see Eq. (O - 
belongs to the family of subset selection problems, and 
is encountered in scientific areas ranging from computer 
science and engineering to statistics and linear algebra. 
For example, in lfl5l . a bank location problem is for- 
mulated as an integer program and algorithms based on 
branch and bound techniques are employed to obtain 
the exact solution. In fl6], approximation algorithms are 
studied that exploit the submodularity property of the 
objective function. Polynomial-time, greedy heuristics 
are offered that yield a solution within (1 — 1/e) of the 
optimum, where e is the base of the natural logarithm. 

In J4|, near-optimal sensor placement for temperature 
monitoring is examined. Temperatures at different lo- 
cations are modeled with a multivariate Gaussian joint 
distribution. The objective is to place a given small num- 
ber of sensors so as to optimally predict the temperature 
at other locations. The aim is to maximize the mutual 
information criterion and, thus, the problem is a variant 
of ©. The authors exploit submodularity and propose 
heuristics that outperform the classical implementation 
of the greedy algorithm. However, these heuristics are 
faster than classical greedy only in special cases such 
as when the covariance matrix of the joint Gaussian 
distribution has "low-bandwidth'Lj. The structure of the 
covariance matrix is also exploited in IPTl where the 

'A covariance matrix E has bandwidth /3 when the variables can be 
ordered in such a way that £;j = when \j — i\ > /3. 



problem of subset selection for regression is studied. 
This is similar in spirit to our problem under the trace 
criterion. The authors propose fast, exact algorithms 
based on dynamic programming, which apply, however, 
only to the special cases of "low-bandwidth" and "tree 
covariance" graphs. Unfortunately, such special cases do 
not commonly arise in real-world applications. For ex- 
ample, in the network monitoring problem, the topology 
and routing of real-life networks often lead to covariance 
matrices with complex structure. This motivated us to 
seek alternatives to the algorithms in |J4], 11171 . which 
can be applied to large-scale real-world applications. 

In mathematics and signal processing community the, 
so called, sparse approximation problem lfT8l . lfl9l . ll20l 
is also similar to ©. In that context, the goal is to find a 
"sparse" subset of a dictionary T> = {cpi}i£{i,...,\T>\} of 
unit vectors that span R w , so that a linear combination of 
the selected vectors best approximates a given signal A e 
M. N . Two common greedy approaches, namely matching- 
pursuit 1181 and orthogonal matching pursuit 0211 . are 
discussed in |[T9l and a two-step greedy heuristic with 
performance guarantees is proposed. 

Subset selection also appears in variable selection 
problems El . J5], l23l . that seek the best subset of 
features of a given design matrix. The problem has been 
extensively studied by the linear algebra and statistics 
community. In J3, a randomized procedure followed by 
a deterministic one is proposed, such that the K "best" 
variables that are returned by this two-phase algorithm 
capture the same amount of information as does the 
subspace spanned by the top K eigenvectors of the data 
matrix. In J23J, one commonly used technique, namely 
the Lasso method, relaxes the "budgetary" constraint 
on the number of variables to be selected by using an 
alternate one on the regression coefficients vector. By 
doing so, the integrality constraint vanishes, and the 
problem can then be solved using standard quadratic 
programming algorithms. 

Recent work on combinatorial design problems that 
minimize the uncertainty of the information of interest 
appear in lf24l . l25l . In l24l . an approximation algorithm 
based on semidefinite relaxations is proposed to solve the 
problem of finding the least squares solution s of the sys- 
tem of equations Hs = y, where s is a vector of binary 
variables and H is a matrix with bounded uncertainty. 
In l25l . the "best" subset of s out of n wireless sensors 
is sought, so as to maximize the detection performance 
of the observed phenomena. 

III. Problem Formulation 

A. A Canonical Framework 

We focus on instantaneous prediction, known as krig- 
ing. In this case, one can capture the information on 
a set of observed variables y (t) — (yi(t))i,zo, O C 
£:={!,••• ,i} (e.g. see Figures [8] and |9). The goal is 



then to predict the information carried on the unobserved 
variables y u (t) = (y e (t)) eeU , U := {1, • • • ,L} \ O, at 
the same time t. (For an example of temporal prediction, 
see J9); henceforth, we often omit the argument t.) For 
jointly Gaussian random variables, the ordinary kriging 
estimate 



Vu 



/i [t + £ U0 E 00 1 (y -/x ), 



(1) 



is the best linear unbiased predictor (BLUP) for y u via 
y , where /x y =(**») and E y = ( *T ^"° 

are the partitions of the mean and the covariance of y 
y(i) into blocks corresponding to the unobserved (u) and 
observed (o) variables. The estimation of /j, y and £ y are 
important problems in practice, which were addressed in 
l9l for the case of network monitoring. For the purpose 
of this work, we shall assume that jjL y and £ y are known. 
Assuming multivariate normality, we also havqj 



Vu\Vo ~ Jv (Vu, "M ^->uo2- J 00 ^oujj 



(2) 



where y u is given by (Q]). In this case, the BLUP y u = 
E{y u \y ) is also the mean square optimal predictor. 
Let the error covariance matrix be 

Eon- = S orr (C) := E[(y u - y u )(y u - y u ) T \yo] 
= E[(y„ - y u ){Vu - Vu) T ] 



where for the second equality we used the fact that 
for Gaussian random variables the error of estimation 
is independent to any linear or non-linear functional of 
the observations y . The objective functions considered 
in this study are: 
(i) A-optimality: 

trace(E err ) = ^ Var(y;|y ) = E\\y u - y u \\ 2 , 
leu 

where || • || stands for the Euclidean norm; 
(ii) E-optimality: 



p(Eerr) 



^err||2) 



(4) 



(5) 



where || • ||2 stands for the spectral matrix norm, i.e., 
largest eigenvalue of S err . Then, the optimal monitoring 
design problem is given by: 

Problem (Optimal Monitoring Design) Find the optimal 
set O* C L := {1, 2, . . . , L} such that: 

Z{0*)= min f(£ elI (0)), (6) 

OC£, s.t. \0\=K 

where Z{0) is the prediction error when monitoring the 
set of links O C C and /(■) is the optimality criterion 
(e.g. trace or spectral norm). 

-Henceforth, we shall assume that the matrix A has full row rank. 
Otherwise, our results can be shown to hold mutatis mutandis with 



The greedy heuristic is a well-known fast method for 
solving (O. Starting from an empty set O, this heuristic 
amounts to incrementally adding to O the link that mini- 
mizes the prediction error Z (or equivalently, maximizes 
the error reduction). Let 6j(0) = Z(O) - Z(0{J{j}) 
be the error reduction when adding element j to the set 
O. The formal algorithm due to Nemhauser et al. Ifl6l 
is as follows. 

Greedy Heuristic. 

1) Let O = 0, W° = C and set k = 1. 

2) At iteration k, select if. G Af k ~ l such that 



ik G &vgmaxdi(0 



fc-i 



(7) 



with ties settled arbitrarily. 

3) If S ik (O k - 1 ) < then stop. Otherwise, set O k = 
O k - x \J{i k } and N k = Af k - X \ {t k }. 

4) If k — K stop and output O . Otherwise, set k = 
k + 1 and go to step 2). 

The "naive" implementation of the greedy heuristic is 
not feasible for large-scale networks, since our optimiza- 
tion criteria involve the inversion and multiplication of 
matrices with dimensions in the order of several tens of 
thousands. In our problem, however, there is a natural 
geometric structure related to PCA that can be used for 
developing fast heuristics. It also leads to an exact and 
efficient implementation of the classical greedy heuristic 
which avoids matrix inversions. The connections to PCA 
also yield lower bounds on the error in (|6), which are 
of independent interest. We present these new results in 
Section [TV] following the introduction of our working 
framework for network monitoring. 

B. Large-scale Network Monitoring 

Consider a communication network of A^ nodes and 
L links. The total number of traffic flows, i.e. source 
and destination (S, D) pairs, is denoted by J. Traf- 
fic is routed over the network along predefined routes 
described by a routing matrix R = (rej)Lxj, with 
rt '.j = 1, when route j uses link £ and otherwise. Let 
x(t) = (x,-(t))/=i and y(t) - (y ( (t))f =1 , t= 1,2,- •■ 
be the vector time seriesj of traffic traversing all J routes 
and L links, respectively. We shall ignore network delays 
and adopt the assumption of instantaneous propagation. 
This is reasonable when traffic is monitored at a time- 
scale coarser than the round-trip time of the network, 
which is the case in our setting. We thus obtain that 
the link and route level traffic are related through the 
fundamental routing equation 



y(t)=Rx(t). 



(8) 



(AqA^) x viewed as the Moore-Penrose generalized inverse. time of the network. 



3 Here, time is discrete and traffic loads are measured in bytes or 
packets per unit time, over a time scale greater than the round-trip 



It is well-known that single link/route traffic traces 
exhibit burstiness over time, which can be statistically 
explained via the notions of long-range dependence and 
self- similarity (see e.g. |26|). 

In (8 ], we proposed a global mechanistic model for the 
traffic on an entire network. The traffic flow along each 
route was represented as a composition of multiple long- 
range dependent On/Off traces describing the behavior 
of individual "users". Such On/Off processes have been 
popular and successful models for the pattern of traffic 
generated by various protocols, services and applications 
(e.g., peer-to-peer, file transfers, VoIP, http, etc.). It is 
well known that the composition of multiple independent 
traces of this type yields long-range dependent models, 
that are well-approximated by fractional Guassian noise 
(see e.g. |26| ). On the other hand, using NetFlow data, 
we found that the traffic flows Xj(t) across different 
routes j are relatively weakly correlatedH] (in j). Thus, 
the routing ([8]), becomes a primary cause of statistical 
dependence between the traffic traces yg(t) across dif- 
ferent links (. in the network. In particular, the greater the 
number of common flows that pass through two given 
links, the greater the correlation between the traffic loads 
on these links. The dependence of ye(t) across "space" 
(links) i and time t can be quantified and succinctly 
described in terms of the functional fractional Brownian 
motion (see J8j). 

In |9l, we developed further statistical methodology 
that allows one to estimate (using NetFlow data) the 
structure of the means /j, x = Kx(t), the flow-covariances 
E x := E(sc(t) — fj, x )(x(t) — Hx) T and their relationship 
for all routes in the network. This leads to a practical 
factor model for the link loads y(t), which can be 
estimated online from traffic measurements of just a 
few links. The estimated model can in turn be used to 
perform network prediction, which is the topic of this 
paper. 

IV. Principal Component Analysis 

A. A geometric view of optimal prediction 

We discuss next how our problem © relates to PCA. 
The covariance matrix S y could be obtained via the 
routing equation ([8]l and the statistical characteristics of 
the J flows, i.e. E y = RY, X R T or could be directly 
available through historical data from past link mea- 
surements (for more details on estimating the covariance 
see |9l). Without loss of generality, we decompose S y 
using singular value decomposition, as E y = AA T with 
some L x J matrix A. The row-vectors of the matrix A 
will be denoted as at £ M. J , I = 1, • • • , L. 

For convenience, we let £ err (0) = E(y — y)(y — y) T , 
with y = (y u y ) T and y = (y u y ) T be the error 

4 Except in periods of congestion where the TCP feedback mecha- 
nism induces dependence between the forward and reverse flows (see 
GO and also Q). 



covariance matrix including both the observed and un- 
observed links. Using (01 one can show that S err (C) = 
2jy — n n 00 Zj , where 2j = AA , 2j 00 = A A , 
and A Q = {aij)i^o.jeJ lS a submatrix of A with rows 
corresponding to the set of observed links O. In practice, 
S y is typically non-singular. 
One thus obtains: 



S err (0) = A(Ij - A T {A A T )- l A )A T 
= A(Ij - P )A T , 



(9) 



where P a := A^(A A^y 1 A . The matrix P a is the 
projection matrix onto the space Wq '■= R&ngc(A^ ) 
spanned by the row vectors {a,; : a^ £ R' 7 ,i £ O} of 
A . Therefore, (Ij — Pq) is the projection matrix onto 
the orthogonal complement Rangc(Aj) ± . 

AppendixlAl shows that \\A— APa\\ 2 F = trace(A(7/- 
Po)A T ) and \\A- AP \\l = p{A{I,j - P )A T ), where 
|| • || p denotes the Frobeniu^l norm. These facts together 
with (O imply that the problem in © is no different 
than the combinatorial problem: 



O* := argmin 

OC.{l,-,L},\0\- 



\A-AP \\l 



(10) 



where || • ||^ is the spectral norm for £ = 2 and the 
Frobenius norm for £ = F. 

Problem J10I . however, has a nice geometric in- 
terpretation. It seeks the "optimal" subspace Wo '■= 
Rangef^Q ) such that the distances, under the Frobeniusj 
or spectral norm, of the matrix A to its row-wise 
projection APq are minimized. 

B. PCA-based lower bounds 

Observe that in dTOb . projection is restricted to the 
subspaces spanned by subsets of K rows of A. Should 
one relax this constraint and optimize over arbitrary 
A'-dimensional subspaces of R J , one would achieve a 
lower bound for the objective function. In this case, PCA 
analysis shows that the optimal space W* is given by 
Pk = VkVr where Vk = (vi,v 2 ,...,vjf) is the 
matrix of the K principal eigenvectors of A T A. Let 
A T A ~ V T DV be the singular value decomposition 
(SVD) of A T A with D = diag(X 1 ,X 2 ,...,X K ,-,Xj), 
where A, > A 2 > ■ ■ ■ X K > ■ ■ • > Xj > (see 1271 . 
[3) and AppendixlAl). The above PCA-geometric obser- 
vation readily gives the following lower bounds for the 
prediction error (see Appendix |A] for proofs). 

Theorem 1 (PCA lower bound for trace). 
J 
Y, A i = ||i4-APjf||S.<trace(E„ r (0)). (11) 

i=K+l 



5 By definition \\B\\ F := v / trace(BB T ). 
The notion of distance is even more transparent in the Frobenius 



case - see Relation 132) in Appendix lAl 



Theorem 2 (PC A lower bound for spectral norm). 

Ajf+i = \\A- AP K \\l < p(E e „(0)). (12) 

The geometric structure of the problem also suggests 
efficient heuristics, discussed in detail in the next section. 
For example, we can think of a sequential "greedy" 
method that picks the space Wq that has smallest "an- 
gle " with the space W* . Note though that the spaces Wq 
in (ITOb should be spanned by K vectors a,, i G C (i.e., 
corresponding to K observed links). Thus, in principle, 
the PCA-based lower bounds are strict, yet extremely 
useful since these bounds also hold for the exact optimal 
solution O* . Therefore, a small relative gap between the 
error of a heuristic and the PCA lower bound implies a 
good approximation to the value Z(0*) of the optimal 
solution. 

V. Approximation Algorithms 

A naive implementation of the described greedy al- 
gorithm for both|3 A- and E-optimality criteria has a 
complexity of 0(KL A ). This is because operations 
such as matrix inversion, matrix multiplication and the 
calculation of the trace or spectral norm are involved 
whenever 5i(O k ~ 1 ) is calculated (see (|7]i and Eqs.([2j>- 
©). This section presents fast, approximation algorithms 
that substantially reduce this computational complexity 
(see Fig. [7] to grasp an idea of the speedup achieved). 

Motivated by the discussion at the end of the previous 
section, one idea is to first pick a link i\ G C for which 
the vector a^ is "closest" to the first PCA component 
Vi. If K = 1, the procedure ends, and one can show - 
by (O and Theorem [3] of the sequel - that this selection 
is very close to the optimal choice where the notions of 
"close" and "optimal" depend on the type of norm or 
optimality criterion. If K > 1, we "subtract" the effect 
of the chosen link i\ by considering only the orthogonal 
projections of the a,'s for the remaining links onto the 
space spanja^}- 1 . We then construct a new matrix A 
with rows given by the above projections and repeat the 
procedure iteratively K times. 

Formally, the matrix A is updated as follows. Assume 
the iterative procedure selects link i^ at step k to be 
added to the set of selected links O. We update A with 
the rule 



J^( k ) — j\(k- 



"(' 



(fc-1) (fc-l) J 
II C*= — 1) II 2 



(13) 



with A^ = A and the row-vectors of A^ k ^ be- 
ing af~ 1] G R J ,i G {1,2,..., L}. The following 
proposition justifies this method, by establishing that the 
proposed procedure can be used to sequentially update 
the prediction error covariance S err (0). This sequential 

7 A more meticulous calculation shows that the complexity for A- 
optimality is 0(K 2 L 3 ). 



projection property is the key behind the computational 
efficiency of the proposed heuristics presented in the 
sequel, since it allows us to avoid expensive operations 
such as matrix inversions. Its proof is given in Ap- 
pendix IbI 

Proposition 1 (Sequential Computation of E crr (C)). Let 

O k be the set of selected links at the end of step k, and 
A^ the iteratively updated matrix, as shown in Eq. (fT3l l. 
Then, 

A (k) A (k) T = s err (O fc ) where (14) 

E err (O fe ) = A(Ij Al(A 0k Al)- 1 A 0k )A T . 

A. A-optimality (trace) criterion 

In our first heuristico we implement the geometric 
idea discussed above where, at step k, we select the link 
ik £ C whose row-vector a,i k is closest to the principal 
component of the current version of A. We can choose 
between two options for vector proximity; the smallest 
angle or the longest projection. We decided to work with 
the latter. We obtain the principal component using the 
power method l27l . The steps of the algorithm are: 

Algorithm 1 (PCA Projection Heuristic - PCAPH). Let 

O a = and AW = A. Set k = 1. 

1) Power Method Step: Using the power method 
obtain a fast approximation to the principal eigen- 
vector vi of A^-V 7 'A( k -V . 

2) SELECTION: At iteration k, choose: 

■ _ I T (fc-1) | 

ik G argmax \v 1 a\ '\ 
ie£\o fc - 1 

where a\' ' G M. J , i G {1, 2, . . . ,£} are the row- 
vectors of AS 1 >. Put ik in the list of links to be 
monitored, i.e., O k := O k ~ 1 U{*fe}- 

3) Projection/Error Reduction: The rows of the 
matrix A*-' are the orthogonal projections of the 
rows of AS ^ onto (span{c^ }) ■ Formally, 



AV>:=AV-*>(lj-- 



(fc-i) a (fc-i) J 



a 



(fc-i) i 



(15) 



4) Set k = k + 1. If k < K, go to step 1). 

As shown in Section IVIII this strategy usually yields 
a monitoring design with slightly larger prediction error 
than the greedy strategy, but is orders of magnitude faster 
in execution time. Specifically, step 1) requires 0(mJL) 
computations, where m is the number of iterations the 
power method executes (m <C min{L, J}). We need 
O(LJ) computations per iteration for the matrix-vector 
multiplication in the power method loop. Steps 2) and 
3) require 0(LJ) operations. 

8 This heuristic is for both A- and E-optimality. For paper organiza- 
tion purposes, it is placed under A-optimality. 



Lemma 1. The complexity of the PCA Projection 
Heuristic is 0(mKLJ). 

The next algorithm, is a very fast implementation of 
the classical greedy algorithm. It avoids calculating the 
inverse of the covariance matrix E 00 . Instead, at each 
iteration we seek the column vector that maximizes the 
error reduction. This is equivalent to finding the vector 
that maximizes the squares of the projections of the 
remaining vectors onto itself. For A-optimality, one can 
show that the error reduction at each step k, given that 
links {i\, ...,ik~i} were already chosen, is equal to: 



R k (i):=R {il ,..., ik _ l} (i) = J2 



,(*-!)„ (fe-l)|2 



.l^ll 2 ' 

(16) 
with i £ C\ {ii, ...,ifc_i}. In words, it equals the sum 
of the squares of the projections to space spanja,- }. 
This step is accomplished by the first step of the algo- 
rithm. The second step, updates the matrix A^ k >. 

Algorithm 2 (Fast Greedy Exact - FGE). Let O = 

and AM = A Set k = 1. 
1) Selection: At iteration k, choose: 



ik £ argmax > 

iec\o k - 1 j=1 

,(fc-i) 



I (k-l) 1 (fe — 1) i2 

K_ Qj T 

||„C*-1)||2 



(17) 



where a~f" '' E M' 7 , i E {1,2, ... ,L} are the row 
vectors of A^^l Set O k = O^ 1 {J{i k }- 
2) Projection/Error Reduction: Do step 3) of 
Algorithm^ i.e., 

A (k) 



A ( k -V(ij-?i 



(fe-1) (fc-l) J 



,(*:-l)||2 



(18) 



3) Set k = k + 1. If k < K, go to step 1). 



Step 1) is a "greedy step" since it picks the link that 
reduces the error the most. It requires 0(JL 2 ) operations 
while, step 2) requires O(LJ) operations after suitably 
rearranging the order of operations. 

Lemma 2. The computational complexity of the Fast 
Greedy Exact algorithm is 0{KJL 2 ). 

Remark 1. Algorithm PCAPH relies on the fast perfor- 
mance of the power method, whose convergence speed 
depends on the ratio |A2|/|Ai| of the matrix under study. 
Thus, one might think that the performance of PCAPH 
may deteriorate when that ratio is close to 1. However, 
it is not affected because: a) We are only interested 
to an approximation of the principal eigenvector so 
a few iterations of the power method suffice, and b) 
other iterative methods could be used instead, such as 
the Rayleigh quotient iteration H271I that has a cubic 
convergence speed when an approximate eigenvector is 
provided (say, from the power method). 



B. E-optimality (spectral norm) criterion 

For E-optimality, we present a very fast implemen- 
tation of the greedy heuristic. Relation © and Propo- 
sition Q] allow us to avoid computationally expensive 
operations like matrix inversion and singular value de- 
composition, which leads to drastic improvements in 
performance. The next algorithm was motivated by the 
following characterization of the largest eigenvalue 



P(^crr) : ~ Al(S err ) 



max 

z£M. L ,\\z\\=l 



z J S r 



(19) 



Algorithm 3 (Fast Greedy Randomized - FGR). Let 

O = and AM = A. Set k = 1. 

1) INITIALIZATION: Generate m independent, nor- 
mally distributed random vectors Xj £ M. L ,i = 
1,2, ... , m from Af(0, 1 if) and set Zj := Xj/||xi||. 

2) Savings Step: At iteration k, k = 1,2,..., K, 
calculate: 



c i :={^A^-^).{A^-^ T 2, i ),\H 



3) SELECTION: At iteration k, select: 



1,2,. 



, TO. 

(20) 



jk £ argmin -j max[Ci 

jec\o k 



i < maxfc 

-1 I Zi 



zfbf-^bf- 1 ^ 



(21) 
where bf _1) := A^-^af'^ and of -1) £ 
M. J ,j G {1,2, ...,L} are the column vec- 
tors of A^^ 1 ' . This corresponds to finding 
the link j that minimizes the error \\A^ k >{I — 
a 3 aj/\\af\\ 2 )A^ T \\ 2 . Set O k = O^ 1 \J{i k }. 

4) Projection/Error Reduction: Do step 3) of 
Algorithm [7] 

5) Setk = k + l.Ifk< K, go to step 2). 

In step 1), we randomly sample in unit vectors Zj : 
Zj e R L ,i = 1,2,. . . ,m, ||zj|| = 1. We use these 
vectors in step 3) to approximate the maximum in (fT9l 
and hence the largest eigenvalue. Note, that in step 2) we 
save the Cj values so as to omit unnecessary repetitions of 
the same quantity. In step 3), we also choose the vector 
(link) that minimizes the error expressed through the 
largest eigenvalue. Finally, in step 4), we update matrix 
A for use in the next iteration. Step 2) requires 0(mLJ) 
operations and step 3) 0(mLJL). 

Lemma 3. The computational complexity of the Fast 
Greedy Randomized algorithm is 0(mKL 2 J). 

The following propositions present theoretical bounds 
on the quality of approximating Ai, the largest eigen- 
value of a matrix, say S. For proofs see the Appendix. 

Proposition 2. Let Ai be the true eigenvalue and Ai 
the approximated one using the heuristic described in 
Algorithm [3] Let also to be the number of random 
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Fig. 2. Evaluation of approximation algorithms on Internet2, and comparison with the exact optimal solution. Moreover, illustration of the 
PCA lower bound, which is useful for assessing the quality of approximation when the exact solution is not known. Compare it with the loose 
(1-1/e) bound proposed by Nemhauser/Wolsey 1 151 - Of course, the bound of 1 1 5| cannot be claimed, due to absence of submodularity. 
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Fig. 3. Ensemble (distributed implementation) of FGR algorithm for Internet2 network. (Left) Ensemble of FGR yields the exact optimal 
solution 88% of the time, whereas classical greedy 76%. (Right) Link selection frequency. Links that connect West to East sites (e.g. links 
13, 14 and 9, 10 that correspond to KANS-CHIC and SALT-KANS, see Table H} are selected the most often by our randomized algorithm. 
Information from links 5, 6 is redundant, so these links are almost never selected. 



used in the heuristic. Then, for any 



vectors ■n e 

positive scalar e > 0, the following holds: 



*(£>' 



> 1 



-S(m,e,L) 



(22) 



where 



S(m, e, L) 



2mY(L/2) 



-L-l 



(L - l)0Fr(-l/2 + L/2) V 1 - e 

(23) 
with T(-) being the V function. 

When some extra information for the eigenvalues of 
matrix £ is available, the next proposition can be more 
elucidating regarding the value of m. Such information 
might be obtained by computing the eigenvalues once, 
before the algorithm starts, so as to help the designer 
grasp an idea for adequate values for m. 

Proposition 3. Let Ai be the true eigenvalue, and Ai 
the approximated one using the heuristic described in 
Algorithm [3] Let also m be the number of random 
vectors z; € M. L used in the heuristic. The eigenvalues 
of matrix £ are Ai > A2 > . . . \l, and c\ > c^ > ■ ■ • cl 
with Ci = Ai/Ai the normalized values thereof. For any 
positive scalar e > 0, there exists an index t such that 
c, > 1 — e, Vi < t and c,- < 1 — e otherwise. Then, 



P(^>1 
•Ai 



> 1 - [F t ,L- 



(L-t 



V t ct - (1 - e) 



(24) 

where F UlM2 (-) is the cumulative distribution function 
for the F-distribution with parameters v\ and V2- 

C. Ensemble methods for randomized algorithms 

Algorithm FGR can be characterized as a randomized 
algorithm. It uses m random vectors to approximate 
the largest eigenvalue Ai, needed to accomplish the 
Selection step. PCAPH may also be implemented in a 
randomized fashion; since it involves the power method 
for approximating vi, one can utilize a random vector 
for the method's initial value. This randomization gives 
rise to the idea of ensembling. 

The main idea of the ensemble method is to pick 
a small m - which makes the algorithm much faster 
- and run several, say r, independent instances of the 
algorithm, in parallel. We then select the solution set 
that yields the minimum prediction error among the 
ensemble. Note that unlike the greedy approach, the 
resulting solution sets O k are no longer nested, i.e. some 
links may be excluded from O k and replaced with others 
as K grows. This helps avoiding one of the artificial 
constraints that greedy procedures impose on the solution 
sets. 



TABLE I 
ID'S OF THE 26 LINKS OF THE INTERNET2 NETWORK. ODD LINK 
ID'S CORRESPOND TO THE UPLINK DIRECTION AND THE EVEN TO 
DOWNLINK; I.E. LINK 7 IS THE LOS ANGELES TO HOUSTON LINK 

and Link 8 is the Houston to Los Angeles link. 



Link ID 


Origin — > Destination 




1,2 


Los Angeles — > Seattle 




3,4 


Seattle -> Salt Lake City 




5,6 


Los Angeles — v Salt Lake 


City 


7,8 


Los Angeles — Y Houston 




9, 10 


Salt Lake City — > Kansas 


City 


11, 12 


Kansas City — * Houston 




13, 14 


Kansas City — > Chicago 




15, 16 


Houston — > Atlanta 




17, 18 


Chicago — > Atlanta 




19, 20 


Chicago — > New York 




21, 22 


Chicago — > Washington 




23, 23 


Atlanta — > Washington 




25, 26 


Washington — > New York 





Our experiments with the topology of Fig. [8] indicate 
that for large enough r, ensemble methods can often 
come close and, in fact, yield the optimal solution of 
Problem ©. For example, Fig. |3(a)| shows the results 
of a distributed implementation of FGR with r = 512 
and to = 20. Comparing with a single execution of the 
classical greedy (or even FGR with r = 1, m S> 20), we 
note that the exact optimal solution is obtained 88% of 
the time. The solution can be obtained in minutes, rather 
than hours, as needed when an integer programming 
formulation is used for the exact solution. Fig. |3(b)| 
shows the selection frequency of each link when FGR 
is employed with r = 1024, m = 10. It is pleasing to 
observe that the links that bridge the East with the West 
sites of the topology have the highest probability of being 
included in the optimal solution. 

Furthermore, as verified in Table |ll] randomization 
allows the network designer to adhere to smaller val- 
ues for m when the option of parallel or distributed 
implementation of the algorithm is available. Table [TT] 
shows the error of the FGR algorithm with respect to 
the optimal solution, for the Internet? network and link 
budget K = 14. We ran several versions of FGR with 
different values for random vectors m and amount of 
parallelization r. The results were run in Matlab on an 
8-core computer. Note that the advantages of paralleliza- 
tion are not entirely revealed in our example since, in 
most cases, r is much greater than the number of cores. 
Thus, it was rather a distributed implementation than a 
parallel one. Nonetheless, the proposed ensemble method 
is inherently parallelizable. Should enough resources are 
present, a very brief time is needed for computing a 
monitoring set with less prediction error than the "naive" 
implementation of the greedy heuristic. For example, 
for to = 64 and r = 256 we get a relative error of 
4.8%, whereas greedy gives 8.5%. If we had 256 cores 
available, then the computational time for that (to, r) 
pair would not exceed the 7 seconds that our 8-core 
machine needed for running FGR with m = 512, r = 8, 




Fig. 4. Visualization of the geometric interpretation of PCA analysis. 
P y (x) refers to the projection of vector x to vector y. 



VI. Performance Guarantees for Error 
Reduction 

This section introduces performance guarantees for 
the error reduction (see Eq. ([TBI ) that can be achieved 
by algorithms PCAPH and FGE when the criterion 
is A-optimality. Note that the (1 — 1/e) bounds for 
greedy algorithms developed by Nemhauser/Wolsey [15] 
cannot be claimed, since the submodularity property 
does not hold for our objective functions. For proofs see 
Appendix [C] 

Theorem 3. Suppose that at iteration k the observed 
set O = {?i, ..., ik}, and the resulting matrix is A' '. 
Let also a\ be a candidate vector for selection by the 
algorithm on step k + 1. Then, the error reduction (see 
Eq. ( TTol l) at step k + 1 is bounded below as follows: 



IqW qWp 



E 



,W||2 



> 



(25) 



(7 «)2 A W _ 2 7 ( fc )^l-( 7 W)2 v /A( fc 




> 



where A, is the jth largest eigenvalue of A^ ' A^ k > 
(i.e., after we have updated our matrix at the previous 
step k, see ( 113l l), and 7^' is the cosine (i.e., see angle 
8 of Fig. |?P between the principal eigenvector v\ of 
^(fc) j^(k) anc j tne se i ecte( i vec tor a.- , i.e. 



Jk) 



M 1 „(*) 

n„( fc )|| 



(26) 



Theorem 4. Suppose that the conditions of Theorem \3\ 
hold. Then, at iteration k + 1, the error reduction is 
bounded from above as follows: 



10 TABLE II 

ENSEMBLING FOR VARIOUS VALUES OF m AND r ON INTERNET2 NETWORK FOR K = 14. WE RAN 50 REPLICATIONS FOR EACH (m, r) 

PAIR TO OBTAIN THE RELATIVE MEAN ERROR (RME) W.R.T THE EXACT SOLUTION, AND 95% CONFIDENCE INTERVALS. CLASSICAL 

GREEDY'S RME IS 8.5%. THE EXPERIMENTS WERE RUN IN MATLAB ON AN 8-CORE COMPUTER. 



(m,r) 


32000,1 


1024.4 


512,8 


512,16 


256,16 


128,32 


64,256 


32,512 


32,256 


8,512 


4,1024 


16,1024 


95% CI (lb) 


9.8 


8.3 


8.5 


8.0 


7.8 


7.6 


4.4 


5.1 


6.4 


7.8 


7.4 


7.8 


RME (%) 


10.2 


8.4 


8.5 


8.2 


8.0 


7.8 


4.8 


5.4 


6.6 


8.0 


7.6 


8.0 


95% CI (ub) 


10.6 


8.6 


8.5 


8.3 


8.1 


8.0 


5.1 


5.7 


6.9 


8.1 


7.8 


8.1 


Time (sees) 


169 


7.5 


7.0 


13.4 


11.8 


21.7 


185.2 


419.8 


183.2 


417.7 


1021 


1029 



L la™ ai fc) l 2 



E^ 



,(*) 



<A^ ) +( 7 (fc) ) 2 Af ) + (27) 



./ = ! 



+2 7 «y / l -( 7 (fc) ) 2 \/A] 






where A,: is the jth largest eigenvalue of AS ' AS ' 
("i.e., after we have updated our matrix at the previous 
step k, see dl3t ), and 7' > is as defined in Eq. 



Fig. [5] demonstrates the bounds when the FGE algo- 
rithm is run on a network with N = 100 nodes and 
L = 195 links (see section IVII-CI for more details on 
that network). 

Remark 2 (Interpretation of Theorems [3] and 2). Say 

that the algorithm selects the link that corresponds to 
vector a) which is at the same direction as v\ , 
Then, ^ ' = 1 and our bounds tell us that we will 



(k) 



have a significant reduction, with value between X\ 
and X\ + A 2 . On the other hand, if the algorithm 
selects a vector with 7^ ' = 0, then the best reduction we 
should expect to achieve is A 2 . This is because we are 
essentially selecting from vectors that are perpendicular 
and the "favorable choice" would be some 



(k) 



to v 

vector that is co-directional with the second principal 
axis v 2 . For all other values of"f( k >, the error reduction 
is in between the values given by the theorems, taking 
into account, of course, that it should be non-negative 
and not greater than the error reduction given by the 
PCA bound of Theorem [7] 

Remark 3 (E-optimality bounds). For E-optimality error 
reduction bounds we can claim the result of Theorem [3] 
This says, that at iteration k + 1 we can expect an 

(k) 

error reduction at most equal to X\ (see proof of the 
aforementioned theorem in Appendix \A[. In addition, we 
have the trivial lower bound that error reduction should 
be non-negative. At this moment, existence of tighter 
bounds remains an open problem. 
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Fig. 5. Illustration of lower and upper bounds offered by Theorems [5] 
and [4] for the N = 100 nodes, L = 195 links network. 



VII. Performance Evaluation 

This section evaluates our algorithms under a variety 
of scenarioo We start with a short discussion for the 
choice of budget K, and then we perform various 
comparisons of the proposed algorithms for different 
network sizes and topologies. Then, we employ our 
approximation methods to a real-world kriging applica- 
tion, demonstrating that the efficient performance of our 
algorithms allows for online implementation even when 
network conditions, like link importance, rapidly change. 



A. Choosing the "budget" K 

Thus far we have assumed that the number of links 
to be monitored, namely the "budget" K, is known to 
the network operator. However, in practice, this is an 
unknown parameter dependent on the monetary budget 
available. We now address the question of choosing 
an adequate value for K, so that the network operator 
can spend the least amount of money while monitoring 
the network with sufficient accuracy. The answer comes 
from PCA via the spectral decomposition of the matrix 
A T A. One wants to choose K according to the "spec- 
trum" of the data. Specifically, K should be such that 
(see Section 6.1 J28l for more details and other criteria): 



K 



Y^ A, > 0.80 x tra,cc(A T A), 



(28) 



i=\ 



where (Ai > ••■ > A p ), p — min{L, J}, are the 
eigenvalues of the matrix A T A. 

'Unless mentioned otherwise, all experiments were run in Matlab 
on a 2.4GHz, dual-core computer with 4GB memory. 
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Fig. 6. Kriging on Internet2. The weighted-error case. (Data for 03/17/09.) Note the steep error drop (f» 100%) between the two predictions 
around the 10th hour. 



Fig. |6(a)| shows the "spectra" of three signals, based 
on the Internet2 traffic data obtain on March 17, 2009. 
We plot the case of uniform priority, the case where links 
24 and 25 are assigned weights wi = 10, and the case 
where wi — 20 for I = 24, 25. (More details on weighted 
link monitoring follow on subsection IVII-FI ) Note that 
as the importance of links increases the "energy" of the 
signal is concentrated on fewer singular values. This is 
intuitively appealing since it suggests that the prediction 
error will be reduced the most if we can afford a budget 
K that is at least as large as the number of the high 
importance links. 

B. Greedy Vs. Exact algorithms 

We use a real-world network, namely Internet2, to 
evaluate our approximation algorithms against the exact 
solution obtained via an exhaustive search. We also cal- 
culate and demonstrate the PCA lower bound. Internet2 
(formerly Abilene) involves L = 26 links, N = 9 nodes 
and J = 72 routes (see (9), ifTTl ). For simplicity, we 
assume that the flow-covariance matrix is S^ = Ij (see 
also 0, 0). In Fig. |2(a)| we examine the trace criterion 
and in Fig. |2(b)| the spectral norm one. In both cases, 
the exhaustive search required several hours to converge 
to a solution (the implementation was done in Matlab 
using CPLEX). On the contrary, the heuristics (PCAPH 
and greedy algorithms) terminate in a few seconds and 
yield a solution very close to the optimal one. 

C. Comparisons of Approximation Algorithms 

We next juxtapose the computational performance of 
our fast approximation algorithms against the classical 
greedy heuristic. We run our simulations on a moderate- 
sized network of N — 100 nodes generated using 
preferential attachment, as described in [29 j. We assume 
J source-destination (S, D) flows with identical traffic, 
such that Xj ~ Af(fi, l),j € {1, 2, ..., J}. The route for 
each (S, D) pair is chosen using the Floyd-Warshall 
shortest path routing algorithm. We created a network 



with a routing matrix R of L = 195 links and J = 500 
flows. 

Fig. |7(a)| illustrates the results for A-optimality. Our 
proposed heuristics are notably faster than the naive 
implementation of the greedy algorithm. Specifically, 
PCAPH is 10 x faster than classical greedy, and FGE 
is 2x faster. The PCAPH algorithm significantly outper- 
forms all other algorithms at the expense of having a 
slightly larger prediction error. Fig. [7(b)| depicts the case 
of E-optimality. Again, our algorithm performs signifi- 
cantly faster (20 x faster) than the naive implementation 
of the greedy heuristic. We used m — 100 random 
vectors for the FGR algorithm. In both figures, we use 
the PCA lower bound to qualitatively assess the solutions 
of the algorithms, since obtaining the exact solution is 
not computationally feasible. 



D. Scaling Up 

In this section, we evaluate our algorithms in a large- 
scale network constructed via the topology simulation 
tool Inet (30). The generated network has N = 3050 
nodes, and we created a matrix A with L = 4800 
links and J = 1000 flows. The naive implementation 
of greedy was not executed for this scenario; based on 
its inferior computational performance observed earlier 
(see Fig. [7), such execution would be impractical. Its 
theoretical complexity is two orders of magnitude slower 
than our new heuristic and one order of magnitude 
slower than the efficient implementation of greedy. Ta- 
ble |nl] shows the running time and the relative error of 
our algorithms with respect to the PCA lower bound. 
Observe that the PCAPH (ran with m = 100) and FGR 
(m = 10) terminate extremely fast to a near-optimal so- 
lution. Actually, FGR yields the optimal solution, which 
in this case attains the PCA lower bound. Thus, both 
algorithms are appealing and robust methods for online 
optimal monitoring design in large realistic networks. 
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Fig. 7. Prediction error on a network with 100 nodes. Note the dramatic decrease in computation time between the naive greedy and the 
proposed algorithms of this paper. 



TABLE III 
INET TOPOLOGY. N = 3050 NODES, L = 4800 LINKS. K = 100. 



Heuristic PCAPH 



FGE 



FGR 



Time (sees) 54 2760 (46min) 546 (9.1min) 

Rel. Err. (%) 1.03 0.99 




Fig. 8. Optimal monitoring on Internet! network. The links in bold 
red lines are the "best" links to monitor. More details on this real-world 
application in Section Ivnl 



E. Network Kriging in practice 

We illustrate next the importance of optimal selec- 
tion applied in the context of network prediction. We 
used the real-world data collected from the Internet2 
network (see 0, EH) on March 17, 2009. We uti- 
lized the PCAPH algorithm to calculate the optimal 
set of links to be monitored. Fig. |9(a)| depicts the 
true traffic on link from CHIC (Chicago) to WASH 
(Seattle) and the predicted traffic when the optimal 
set of links is used. As Fig. [H] shows, this includes 
links KANS^CHIC, CHIC^KANS, LOSA^HOUS, 
HOUS->LOSA and SALT^KANS. As one would ex- 
pect, an optimal global view is attained by monitoring 
links that connect network sites between West and East 
(see also 13). Fig. |9(b)| shows the empirical relative 
mean squared error (ReMSE) for the whole network 
on the day of interest. We compare the quality of 
prediction when monitoring: a) a non-optimally chosen 
set, b) the optimal set used throughout the day, and c) 
the optimal set, periodically recalculated every 8 hours. 
The last method accounts the newest history available, 



dynamically re-estimates matrix E y and calculates the 
new optimal set. ReMSE is defined as, 

ReMSE{t) = £>(*) ~ 2/z(t)) 2 )/5>(t) 2 ). 



leu 



leu 



The ReMSE time average is 0.09, 0.07 and 0.056, 
respectively. This clearly shows the advantages of using 
fast approximation algorithms that can be dynamically 
used for optimal link monitoring. In the following 
subsection, we show that the same algorithms can be 
employed for solving the weighted monitoring design 
problem, arising when the importance of monitoring sites 
is not uniform. 

F. Weighted Link Monitoring 

Our methods can be easily adjusted to handle cases 
where the relative importance of the links is not uniform. 
Such weighted design is particularly useful when net- 
work conditions drastically change in a dynamic manner. 
Recall that in the trace criterion case we have: 

trace(E[(y-y)(y-y) T ]) 
= E(trace[(y-y)(y-y) T ]) 
= E[||y - y\\ 2 ] = E[(y - yf(y - y)]. (29) 

The weighted monitoring design problem aims to 
minimize 

L 

E[(y - y) T G(y -y)] = J2 w ^ - w) 2 ' (30) 

where G := diag(iui, . . . , wl) is the matrix assigning 
the link weights. After some algebra we get: 

E[(y-y) T G(y-y)} 

= n{G l/2 y-G^ V f{G^y-G^y)] 

= E[(y G -y G ) T (y G -y G )], (31) 

where y G := G x l 2 y and y G := G x l 2 y. This shows that 
the trace-optimal solution y G with the new "routing" 
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Fig. 9. Kriging on Internet2. (Data for 03/17/09.) 
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matrix A G := G 1 ! 2 A optimizes OTb - We can apply our 
fast algorithms to approximate the standard trace-optimal 
solution y G . The optimal predictor for the weighted trace 
criterion in (|30t is then obtained by y = G~ x l 2 y G . 

Fig. |6(b)| shows the (weighted) prediction error {i.e., a 
weighted ReMSE) for the scenario where the importance 
of monitoring links WASH -> NEWY and WASH -> 
ATLA elevates. We can model this by assigning unit 
weights to all other links, and a weight of ten to the 
important ones. Based on Eq. (l28l l and the spectrum 
shown in Fig. |6(a)| we select a budget of K = 7. The 
solution given by PCAPH (which actually coincides with 
the exact optimal solution) includes the high importance 
links, plus links LOSA - HOUS (both directions), KANS 
- CHIC (both directions) and SALT -> KANS. Fig. [6(5)1 
clearly shows that the traffic prediction based on this set 
of links is way more accurate than the one based on a 
randomly chosen set (that includes the important links 
too). 

VIII. Discussion 

In conclusion, large-scale optimal monitoring is an im- 
portant, yet computationally challenging problem, suit- 
able for applications including anomaly detection in 
communication networks and environmental monitoring 
using sensor networks. We developed fast algorithms for 
approximate solutions that can be applied to large scale 
networks in real time. Randomized implementations of 
the algorithms in a distributed or parallel fashion can 
even yield the optimal solution in a fraction of the time 
needed to obtain exact solutions using integer program- 
ming techniques. Moreover, the novel PCA-based error 
lower bounds are practical, network-specific alternatives 
to existing theoretical lower bounds (see 03]), and 
help us assess the quality of approximation. We discuss 
how budget-selection can be achieved via PCA, and 
have applied our algorithms in numerous scenarios. The 
conclusion drawn is that our methods highly outperform 
traditional greedy techniques (see J4]), and hence are 
amenable and more scalable for online implementation 
in dynamical environments. 



Appendix A 
Principal Component Analysis theorems 

Let Y, y = AA T be the covariance matrix of the vector 
of all links y. SVD of A yields A = UB X I 2 V T , with 
D 1 / 2 := diag(t7i, . . . , a p ), p = min{L, J}, where <7j's 
are the singular values of A. Thus, (Ai > • • • > X p ) = 
(o*i > • ■ ■ > <7p ) are the eigenvalues of the J x J matrix 
il = A 7 A. Recall that the columns of A T are denoted 
by an e R J J = 1, • • • ,L. 

Theorem 5 (Trace). Let Pw denote the projection 
matrix onto a sub-space W < IR" 7 . Then, 



mm 

W<R J , dim(W)=K 



L 

Ei 
1=1 



ae- P w ai\\ 2 = 



p 

E A 3 
j=K+l 



A, 



where the lower bound is achieved for the sub-space 
W* = span(vi,--- , Vk) of the eigenvectors corre- 
sponding to the largest K eigenvalues of fl. 

The proof of the theorem is given in J9)- 

Proof of Theorem Q} We will use Theorem [5] 
Recall that S err (0) = A(I - Po)A T , where P a := 
AT(A AT)~ 1 A . The projection matrix (I — Pq) is 
symmetric and indempotent {i.e., {I—Pq) = {I — Po) 2 ), 
so we have S err = [(/ - P a )A T ] T [(I - P a )A T ], and 
therefore 



trace(S err (C)) 



L 

Ei 

e=i 



(I-Po)a e \\ 2 = J2 dist(o/, Wo) 2 



(32) 

where Wo '■= span(a£, I £ O) is the sub-space 
spanned by the vectors at corresponding to the observed 
links. The vector (I — Po)a,( is the "perpendicular" 
dropped from point ai to the hyperplane Range(A^). 
Hence ||(7 — Po)a£|| is the distance from ai to 
Range(A^) = span(a^, I e O), where A 1 * = (a^ G0 . 
Using Theorem [5] we see that the sum of (l32l is 
minimized when the projection matrix Po equals Pk = 



VkVk- Hence, 

tracc(£ crr (e>)) = tracc[yl(/ - P )A T ] 

v 
>\\A-AP K \\%= J2 A - 

i=K+l 

■ 

Similar result holds for the spectral norm case: 
Theorem 6 (Spectral norm). 

min \\A - B|| 2 = \\A- AP K h (33) 

rank(B)=if 

w/iere Pk is the projection matrix Pk = VkVk- The 
columns of matrix Vk are the top K right singular 
vectors of A J.ejB Pk = V r diag(l^, Ol_ K )V T . 

For the proof of this result, see Theorem 2.5.3 in ||27]. 
Proof of Theorem [2} Using Theorem [6] we need 
to show that: 

p(Serr(0)) = p[A{I-P )A T ] > \\A-AP K \\l = Xk+1- 

We first calculate the lower bound when Pk = 
VkVk- We use the SVD of A, A = UD 1 /^ and 
the projection matrix / — Pk = Fdiag(0^, ~^l-kW T ■ 
Thus, we have 



For notational simplicity, let O := O k , i.e., we drop 
the subscript k of the matrix A 0k . It is easy to see that 
the sequential updates of the matrices A^ in ( [T3l l can 
be represented in matrix form as follows: 



A (k) =A P 1 P 2 ---P k , 



(35) 



where 



Pi=I- 



,P k = I~ 



(fe-i) (fc-i) J 



,( fc -!)| 



A - AP K = UD 1/2 V T Vdiag(0 K , 1 



L-K 



w 1 



= UD l / 2 ^g(0' K ,ll-K)V T 

= Udi&g(0 K ,<TK+i, ■ ■ ■ ,o- L )V T . 
By the definition of spectral norm: 

|| A - AP K \\l = P ((I7diag(0£,<rx+i, . . . ,a L )V T f 
x[/diag(0^, (tk+1, ■■■, °l)V t ) 
= p(Udiag(0^,AK+i,. ■ • Al)V t ) 
= Ajsr+i. 

Consequently, using also Theorem [6] we obtain: 

p(Z eir (0)) = p[A(I - P Q )A T ] 

= p[(A-AP )(A~AP ) T ] 
= \\A-APo\\\ 
>\\A-AP K \\l 

= A_Ff+i. 



Appendix B 
Proposition Proofs 

Proof of Proposition [7J The proof resembles the 
Gram-Schmidt orthogonalization procedure. We will 
show by induction that 

A (k) A (k) T = A{1 _ Al{A A r )- 1 A o )A T . (34) 

10 We symbolize the vector of ones of dimension k as lj. and the 
vector of zeros as Ofe. 



(k) J 

Here a\ € R denote the rows of the matrix 
A^ k \ where by convention A^ = A. Observe that 
Pk is the orthogonal projection matrix onto the space 
(spanja^ ~ }) ± , i.e. the orthogonal complement of the 
one-dimensional space spanned by a\ . Here a\ 
corresponds to the link ik added to the set O on step k. 

This shows that on the fc— th step all the rows of the 
matrix A^ are orthogonal to spanja^,--- , aj fc } = 
spanjaj- , • • ■ , a\ }. The last subspace is generated 
by the vectors corresponding to the links {ii,--- ,ik} 
added on the first k steps. 

Now, to complete the proof, it is enough to show that 



P P Pi = P 

r\r2 ' ' ' *k - r span(ai 1 ,••• ,a^ 

Indeed, we have that 



(36) 



1 A [A A ) A , 



* (span{ai 1 ,-■■ .a^, I)- 1 - 



where A^ = (a^,-- ,a,i k ). Therefore, by (1331 and 
d36l . one obtains d34h which yields (TPfl i. 

We now prove (l36l l by induction. 

Induction Basis: Relation d36t trivially holds for k = 1. 

Induction Hypothesis: Suppose that (f36t holds. 

Induction Step: We will show that d36l l holds with k 
replaced by k + 1. 

(k) 

Note that by the induction hypothesis a\ ' 
is the orthogonal projection of a-i k+1 onto 
(spanja^, • • ■ , a^})- 1 . Therefore, 

span{a n ,--- ,a ik ,a ik+1 } 

= spaii{a n , • • • , a lk \ @ span{ a^jj, 

where © denotes sum of orthogonal subspaces of M. J . 
This shows that 



span^a ^ 

r L l fc+l J 



-fspan{a il ,---,a ifc ,a ifc + 1 } — Pspanfa^ ,■■■ ,a ifc }+ " spall { a W 

Since Pv^-l = I — Pw, we obtain 



P, 



(spanfa^ ,••• ,a ifc ,Oi })^ 



(37) 



■* Pspan{ ail , ■■■,«; } "span/a!') }• 



(fe) 



Note, however, that since a\ ' _L spanja.^, • • • , a,i k }, 
we have ^pan{a, 1 ,... I a i)e }-P p + l {a C*) } = 0, and the 
right-hand side of (137b equals 

U - "span{a, r -,a, jjU _ span{a (fc) ' 



P 



P 



span{ai 1 ,---,o Ifc } i - r span { (»; }J 



,(«0 



This, in view of the induction hypothesis and d37l ). 
implies that 



P 



(span{a il ,--- ,a ifc ,a ifc+1 })-■- r l 



PfcP 



fc-tfc+1, 



which completes the proof of the induction step. ■ 

Proof of Proposition [2} By definition, Ai is the 
following random variable: 



Ai :=max.{z'['Ezi,...,z'£ l Y;z m }, 



(38) 



where Zi <E R L , i = 1, ...,m are hc/ random variables 
distributed on the unit sphere S *. Let the SVD of 
E be E = VZ?F T , with the columns of V being the 
singular vectors of £ and D = diag(Ai, ..., Al,)- We 
use the fact that for every orthogonal matrix Plxl the 
rotated vectors Wi = Pzi, i = 1, ...,m are also iid 
random variables on the unit sphere S L_1 . Then, for 
P = V T , 

Ai = max{ w J PY*P T Wi} = max{wf Dw{\ 

i i 

= max{Aiw 2 (l) + A 2 w 2 (2) + . . . + \ L w 2 (L)} 



w.p. 1 

> max{Aiw 2 (l)}, 



(39) 



where Wi(k) is the fc-th component of the vector Wi. 
Then, 

P(Ai > Ai(l - e)) > P(max{Aiu; 2 (l)} > Ai(l - e)) 

i 

= l-P(max{ W 2 (l)}<(l- £ )). 

(40) 

A moment's reflection shows that the distribution of 
the r.v. w 2 (l),\/i is identical with the distribution of 
z 2 (l),Vi. By construction, for all i 



*f(l) 



*?(1) 



(41) 



x 2 {l) + x 2 {2) + ... + x 2 {L) 

where Xi(j), j = 1,...,L, i = 1, ..., m are //a? r.v. from 
the normal distribution A/"(0, 1). 

It is well known (see BTI . Section II. 3) that if 
X\ , X2 , • ■ • , X n are mutually independent Gaussian r.v. 
with expectation and variance a 2 , then X 2 + X\ + 
. . . + X 2 follows the Gamma distribution with param- 
eter a = l/(2cr 2 ) and v = n/2. Therefore, for all 
i = 1, . . . ,m 



x 2 (l)-Gamma(-,- 



*i(2) 



+ .t 2 (L) ~ Gamma(-, 



2'2 J 
1 L-l 



(42) 



)■ (43) 



It is also known that if X, Y are random variables 
with distribution Gamma(a, v) and Gamma(/3, v), re- 
spectively, then the r.v. X/(X + Y) follows the Beta 
distribution Beta(a,/3). Hence, for all i = 1, ...,m 



*?(1) 



/I L-l 
Beta(-,— — ). 



x 2 (l) + a; 2 (2) + ... + :r 2 (L) 

Continuing from Eq. (l40l 

P(max{w 2 (l)} < 1 - e) = [P(w 2 (l) < 1 - e) 

= (| e f(x)dx)" 1 ={l- j f(x)dx) m 



(44) 



< e 



- m (fl- e f( x ) dx ) 



(45) 



where /(x) is the density function of the Bcta(l/2, (L — 
l)/2) distribution, and in the last inequality we used the 
fact that 1 - x < e~ x , Vx <G R. 

We now bound the integral in the exponent. 



1 



T(L/2) 



L (l-x) /5 - 1 dx 



< 



r(i/2)r((L-i)/2) 
r(i/2) 



! (1 - x)^~dx 



l-e 



^T(-l/2 + L/2] 



(1-sY 



l-e 



L-3 

(1 - x) 2 dx. 
(46) 

After calculating the integral and some algebra, the result 
follows. ■ 

Proof of Proposition \3\ By definition, Ai is the 
following random variable: 



Ai := max{ 



xi 



x i 









-}, (47) 



with Xi <G R , i = l,...,m are independent ran- 
dom vectors from the multivariate normal distribution 
J\f(0,I L ) Let the SVD of E be E = VDV T , with 
the columns of V being the singular vectors of S and 
D = diag(Ai, ..., Al). We multiply each vector x, with 
the orthogonal matrix P, with P = V T to get Wi = Pxi. 
Since this operation preserves inner products, angles and 
distances, the vectors Wi are also iid from the Af(0, II) 
distribution. Hence, 



Ai =max{—i-rVDV 1 



w 



Wj 



} = max{— ^-rrD— -\} 



max{ 



EU^Uj) 



Wi 



} = Ai max{ 



E,=i c i^?(i) 



Wi 



}■ 



(48) 



Given e > 0, the precision of approximation is: 
P(Ai > Ai(l - e)) = P(max& > 1 - s), (49) 

i 

th the random variables £j = $^ =1 Cjwf (i)/||wi|| 2 , 
, m being independent and identically dis- 



i = 1, 
tributed. 



Proceeding, we have 
P(max£i > 1 — e) = 1 — P(max£j < 1 — e) 

i i 

= l-[P(Ci<l-£)] m , (50) 

with £i chosen without loss of generality. Explicitly 
writing ^ we obtain, 



have 7 2 + ||a^|| 2 = 1. Using the Cauchy - Bunyakovsky 
- Schwarz inequality we then bound the term | (a,j , a^~)\ 
as \(a,j,a^-)\ < \\aj\\\\a-j-\\ = \\a,j\\ \Jl — 7 2 . Therefore, 



x 2 j > 7 2 y| - 27^||a J ||Vl-7 2 



(53) 



Now we sum over all j € C to get the error reduction 
for all links. We thus get, 



P(£i<l-e)=P(£c^(j)<(l-e)I>?(i) 

3 = 1 3=1 

= p(j2[c J -(l-e)]w 2 1 (j)< J2 [(l-e)-c> 2 (j) 

3=1 J=t+1 

t L 

<p(^[ Cj -(1-s)K(j)< E (i-eKC?) 



L L L 

E ** -^ 2 E y| - 2 7 Vl-7 2 E^H^ i 

3 = 1 3=1 3=1 

= 7 2 A« 



(54) 



3 = 1 



3=t+l 



I 
3 = 1 



3=1 3=*+l 

(51) 

where we used the fact that a > 1 — e for all i < t, and 
Cj < 1 — e otherwise, to obtain the last two inequalities. 
The random variables Ylj=i w iU) nave chi-square 
density with i-degrees of freedom (see PP . Section 
II. 3). Similarly, he random variables ^_ t+1 w\{j) have 
chi-square density with (L — t) -degrees of freedom. 
Using the fact (see |3T1 , Section II. 3) that the random 
variable F = (X / v x ) / {Y / v 2 ) - with X,Y being chi- 
squared distributed with v\ and V2 degrees of freedom 
respectively - has the F-density with parameters v\ and The result follows. 
V2, the result follows. 



2 7 Vw r E%K-H' 

3 = 1 

where in the last equation we used TheoremQ]for K = 1 
on matrix A^; thus, the PCA error reduction equals the 
largest eigenvalue, A^ ' . 

We now isolate the term X) 7 =i 2/3'll a 3'l an d using 
again the Cauchy - Bunyakovsky - Schwarz bound we 
get 



< 



£to) 



A 3 = 1 A 3 = 1 



Ek 



(55) 
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Appendix C 
Error Reduction Bounds 

Proof of Theorem^ For ease of notation, we drop 
the superscript k. For the same reason, we introduce the 
operator (ei, e.2) to represent the inner product between have 



Proof of Theorem @- Again, when there is no 
ambiguity we drop the superscript k. Let Xj := 
|(Oj,ai)|/||oi|| be the length of the projection of any 
vector a,j to the selected by the algorithm vector a^. 
Also, let i/j := \{o>j,v\_ )\ be the projection length on 



As before aj/||aj|| = tv^' 



(fe) 



two vectors. Let i,- 



(aj,a,i 



\di\\ be the length of 



the projection of any vector a_, to the selected vector a,. 
Also, let yj := \{a,j,v\ ')\ be the projection length on 



z 2 = |<a,,7v( fc) 



a-f. Proceeding, we 
(56) 



= l7(«j 



,( fc )\ 



{aj,a d 



Note that ai/\\ai 



(a i /||a i ||,v 1 (fe V, (fe) 



7V^' V 4- aj-. Proceeding, we have 



c? = Ka i)7 vi fc) 

= \j(a3A k) ) 

r (k) 



at = 



(52) 



= 7 2 K,v( fe) ) +2 1 {a J1 ^){a J ,ai) + (a,,a,^) 2 
= 7 2 y? + 2 1Vj {a ,ai) +{a 3 ,at) 2 

< 7 2 y 2 + 2 7% ||a,||v / l-7 2 + (a J ,^) 2 



{a :j ,a z 



>|7|(a,-,vn|-|(a3:«,- 



< 7 2 2/ 2 + 2 7 \Ai %) 



>l 2 (a,A k) ) 



2 7 |(a„vi 



W\ 



EAfv/T^^+K,^ 



\£ 



W.«i 



(57) 



7 2 2/ 2 -272/3l( a 3: a 4 J 



with the last two lines obtained using the Cauchy - 
after substitution of the projection length yj on the first Bunyakovsky - Schwarz inequality on the specified 
principal axis. Observe that due to orthogonality, we terms. 



Note that for 
[ k) . Then, (0\ 



> 0, 



- ^,( fe ) 



ay 



i.e. a^ 



at, ay, 



= (a^, ay 1 ) because 



a./- ± Vj as well. Moreover, notice that 

|<aj-,a^)|/K„ 

construction of II a 



(at, at 



< 
because \\a^\\ = \/l — 7 2 < 1 (by 






). These steps suggest that: 



< 



E(°i>v 



(fe)\ 

2 / 



A 



(A-) 



(58) 



This result follows from PCA. It basically says that the 
sum of the squares of the projection lengths of any vector 
is bounded by the sum of the squares of 



aj- on aj- 



(A-) 



the projections on the second principal axis, v 2 . In 



f 2 



other words, v, gives the maximum projection with 



respect to any other vector from Null(v^ ). Summing 
both sides of $5% over all j = 1, ..., L, and using 
concludes the proof. 



Appendix D 

NP-HARDNESS 

We prove NP-hardness using reduction from the NP- 
hard problem Subset Selection for Regression ifTTl . 

Definition 1 (Subset Selection for Regresssion). Given 
matrix C, vector b and k, select a set Sofk observation 
variables that minimize the mean square prediction error 
Err(Z, S) = Var(Z) - b^C^bs- 

Z is the predictor variable, and X±,...,X n are the 
observation variables. The covariance matrix of the ob- 
servation variables is denoted by C. Vector b denotes 
the co variances between Z and JQ, i = 1, ..., n. 
Proof of NP-hardness: 

Clearly, the decision version of our problem is in NP 
This is because given a candidate solution O, we can 
decide in polynomial time whether or not the prediction 
error using the given set O is greater or less than a 
given value. We will use the method of Restriction lfl4l 
to show that every instance of the problem in IfTTl is a 
special case of an analogous to __) problem. 

Consider the problem of selecting a subset O of links 
in order to best predict the traffic at a specific link i, 
i $l O. W.l.o.g, say that i = 1, i.e. we want to best 
predict traffic at link 1. Then, the optimization problem 
is to select Oc£\{l} that minimizes the following: 

trace(£ u - Ei, E-_E 0t i) or (59) 

p(S u - Ei, £-£! 0l i) (60) 

which correspond to our A- and E-optimality criteria, 
respectively. But the quanity in parenthesis is a scalar, 
so the trace and spectral norm operations are redundant. 
Going back to the subset selection for regression we 
observe the following: Variable Z corresponds to traffic 



at link 1, i.e. it corresponds to variable y\. The vector b 
corresponds to the first column (or row) of matrix E, i.e. 
E : i. Matrix C corresponds to E^, where £ = C\ {1}. 
Hence, using the notation of our problem, the problem 
in IfTTl aims to minimize, by choosing a set O with \0\ = 
k, the error En — Ei E~qE 0j i. Thus, every instance of 
the NP-hard problem in [17| is a special case of our 
problem. Because our problem is in NP, it is then an 
NP-hard problem. ■ 
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